Simulation Studies for Evaluating ForestSearch Performance

Operating Characteristics and Power Analysis

Author

ForestSearch Package

Published

February 7, 2026

1 Introduction

This vignette demonstrates how to conduct simulation studies to evaluate the performance of ForestSearch for identifying subgroups with differential treatment effects. The simulation framework allows you to:

  • Generate synthetic clinical trial data with known treatment effect heterogeneity
  • Evaluate subgroup identification rates (power)
  • Assess classification accuracy (sens, spec, PPV, NPV)
  • Compare different analysis methods (ForestSearch, GRF)
  • Estimate Type I error under null hypothesis

1.1 Simulation Framework Overview

The simulation workflow consists of four main steps:

flowchart LR
    A[Create DGM] --> B[Simulate Trials]
    B --> C[Run Analyses]
    C --> D[Summarize Results]

  1. Create DGM: Define a data generating mechanism with specified treatment effects
  2. Simulate Trials: Generate multiple simulated datasets
  3. Run Analyses: Apply ForestSearch (and optionally GRF) to each dataset
  4. Summarize Results: Aggregate operating characteristics across simulations

1.2 Key Updates in This Version

The simulation framework has been aligned with generate_aft_dgm_flex() methodology:

Feature Description
Individual Potential Outcomes theta_0, theta_1, loghr_po columns
Average Hazard Ratios (AHR) Alternative to Cox-based HR estimation
Stacked PO for Cox HR Same epsilon for causal inference
use_twostage Parameter Faster exploratory analysis option
Backward Compatible Works with old and new DGM formats

2 Setup

# Core packages
library(forestsearch)
library(weightedsurv)

library(data.table)
library(survival)
library(ggplot2)
library(gt)

# Parallel processing
library(foreach)
library(doFuture)
library(future)

# Source simulation functions (if not yet in package)
# source("sim_aft_gbsg_refactored.R")
# source("oc_analyses_gbsg.R")

3 Creating a Data Generating Mechanism

The simulation framework uses the German Breast Cancer Study Group (GBSG) dataset as a template for realistic covariate distributions and censoring patterns.

3.1 Understanding the DGM

The create_gbsg_dgm() function creates a data generating mechanism (DGM) based on an Accelerated Failure Time (AFT) model with Weibull distribution. Key features:

  • Covariates: Age, estrogen receptor, menopausal status, progesterone receptor, nodes
  • Treatment effect heterogeneity: Specified via interaction terms
  • Subgroup definition: H = {low estrogen receptor AND premenopausal}
  • Censoring: Weibull or uniform censoring model

3.1.1 New Output Structure (Aligned with generate_aft_dgm_flex)

The DGM now includes:

dgm$hazard_ratios <- list(
  overall        = hr_causal,      # Cox-based overall HR
  AHR            = AHR,            # Average HR from loghr_po

AHR_harm       = AHR_H_true,     # AHR in harm subgroup
  AHR_no_harm    = AHR_Hc_true,    # AHR in complement
  harm_subgroup  = hr_H_true,      # Cox-based HR in H
  no_harm_subgroup = hr_Hc_true    # Cox-based HR in Hc
)

The super-population data (dgm$df_super_rand) now contains:

Column Description
theta_0 Log-hazard contribution under control
theta_1 Log-hazard contribution under treatment
loghr_po Individual causal log hazard ratio (theta_1 - theta_0)

3.2 Alternative Hypothesis (Heterogeneous Treatment Effect)

Under the alternative hypothesis, we create a DGM where the treatment effect varies across patient subgroups:

# Create DGM with heterogeneous treatment effect
# HR in harm subgroup (H) will be > 1 (treatment harmful)
# HR in complement (H^c) will be < 1 (treatment beneficial)

dgm_alt <- create_gbsg_dgm(
  model = "alt",
  k_treat = 1.0,
  k_inter = 2.0,      # Interaction effect multiplier
  k_z3 = 1.0,
  z1_quantile = 0.25, # ER threshold at 25th percentile
  n_super = 5000,
  cens_type = "weibull",
  seed = 8316951,
  verbose = TRUE
)

# Examine the DGM (print method now shows both HR and AHR metrics)
print(dgm_alt)
GBSG-Based AFT Data Generating Mechanism (Aligned)
===================================================

Model type: alt 
Super-population size: 5000 

Effect Modifiers:
  k_treat: 1 
  k_inter: 2 
  k_z3: 1 

Hazard Ratios (Cox-based, stacked PO):
  Overall (causal): 0.7331 
  Harm subgroup (H): 2.9651 
  Complement (Hc): 0.6612 
  Ratio HR(H)/HR(Hc): 4.4846 

Average Hazard Ratios (from loghr_po):
  AHR (overall): 0.7431 
  AHR_harm (H): 3.8687 
  AHR_no_harm (Hc): 0.5848 
  Ratio AHR(H)/AHR(Hc): 6.6157 

Subgroup definition: z1 == 1 & z3 == 1 (low ER & premenopausal) 
ER threshold: 8 (quantile = 0.25)
Subgroup size: 634 (12.7%)
Analysis variables: v1, v2, v3, v4, v5, v6, v7 

3.2.1 Accessing Hazard Ratios (New Aligned Format)

# Traditional access (backward compatible)
cat("Cox-based HRs:\n")
Cox-based HRs:
cat("  HR(H):", round(dgm_alt$hr_H_true, 4), "\n")
  HR(H): 2.9651 
cat("  HR(Hc):", round(dgm_alt$hr_Hc_true, 4), "\n")
  HR(Hc): 0.6612 
cat("  HR(overall):", round(dgm_alt$hr_causal, 4), "\n")
  HR(overall): 0.7331 
# New AHR metrics (aligned with generate_aft_dgm_flex)
cat("\nAverage Hazard Ratios (from loghr_po):\n")

Average Hazard Ratios (from loghr_po):
cat("  AHR(H):", round(dgm_alt$AHR_H_true, 4), "\n")
  AHR(H): 3.8687 
cat("  AHR(Hc):", round(dgm_alt$AHR_Hc_true, 4), "\n")
  AHR(Hc): 0.5848 
cat("  AHR(overall):", round(dgm_alt$AHR, 4), "\n")
  AHR(overall): 0.7431 
# Using hazard_ratios list (unified access)
cat("\nVia hazard_ratios list:\n")

Via hazard_ratios list:
cat("  harm_subgroup:", round(dgm_alt$hazard_ratios$harm_subgroup, 4), "\n")
  harm_subgroup: 2.9651 
cat("  AHR_harm:", round(dgm_alt$hazard_ratios$AHR_harm, 4), "\n")
  AHR_harm: 3.8687 

3.2.2 Examining Individual-Level Treatment Effects

# The super-population now includes individual log hazard ratios
df_super <- dgm_alt$df_super_rand

cat("Individual-level potential outcomes:\n")
Individual-level potential outcomes:
cat("  theta_0 (control log-hazard) range:", 
    round(range(df_super$theta_0), 3), "\n")
  theta_0 (control log-hazard) range: -0.891 1.76 
cat("  theta_1 (treatment log-hazard) range:", 
    round(range(df_super$theta_1), 3), "\n")
  theta_1 (treatment log-hazard) range: -1.427 2.909 
cat("  loghr_po (individual log-HR) range:", 
    round(range(df_super$loghr_po), 3), "\n")
  loghr_po (individual log-HR) range: -0.537 1.353 
# Verify AHR calculation
ahr_manual <- exp(mean(df_super$loghr_po))
cat("\nAHR verification:\n")

AHR verification:
cat("  exp(mean(loghr_po)) =", round(ahr_manual, 4), "\n")
  exp(mean(loghr_po)) = 0.7431 
cat("  dgm$AHR =", round(dgm_alt$AHR, 4), "\n")
  dgm$AHR = 0.7431 
# Distribution of individual treatment effects
cat("\nIndividual HR distribution:\n")

Individual HR distribution:
individual_hr <- exp(df_super$loghr_po)
cat("  Mean:", round(mean(individual_hr), 4), "\n")
  Mean: 1.0012 
cat("  Median:", round(median(individual_hr), 4), "\n")
  Median: 0.5848 
cat("  SD:", round(sd(individual_hr), 4), "\n")
  SD: 1.0928 

3.3 Calibrating for a Target Hazard Ratio

Often, you want to specify the exact hazard ratio in the harm subgroup. Use calibrate_k_inter() to find the interaction parameter that achieves this.

3.3.1 Calibrate to Cox-based HR (Default)

# Find k_inter for Cox-based HR = 2.0 in harm subgroup
k_inter_cox <- calibrate_k_inter(
  target_hr_harm = 2.0,
  model = "alt",
  k_treat = 1.0,
  cens_type = "weibull",
  use_ahr = FALSE,  # Default: calibrate to Cox-based HR
  verbose = TRUE
)

# Create DGM with calibrated k_inter
dgm_calibrated_cox <- create_gbsg_dgm(
  model = "alt",
  k_treat = 1.0,
  k_inter = k_inter_cox,
  verbose = TRUE
)

cat("\nVerification (Cox-based):\n")

Verification (Cox-based):
cat("Achieved HR(H):", round(dgm_calibrated_cox$hr_H_true, 3), "\n")
Achieved HR(H): 2 
cat("HR(H^c):", round(dgm_calibrated_cox$hr_Hc_true, 3), "\n")
HR(H^c): 0.661 
cat("Overall HR:", round(dgm_calibrated_cox$hr_causal, 3), "\n")
Overall HR: 0.722 

3.3.2 Calibrate to AHR (New Option)

# Alternatively, calibrate to Average Hazard Ratio
k_inter_ahr <- calibrate_k_inter(
  target_hr_harm = 2.0,
  model = "alt",
  k_treat = 1.0,
  cens_type = "weibull",
  use_ahr = TRUE,  # NEW: calibrate to AHR instead
  verbose = TRUE
)

# Create DGM with AHR-calibrated k_inter
dgm_calibrated_ahr <- create_gbsg_dgm(
  model = "alt",
  k_treat = 1.0,
  k_inter = k_inter_ahr,
  verbose = TRUE
)

cat("\nVerification (AHR-based):\n")

Verification (AHR-based):
cat("Achieved AHR(H):", round(dgm_calibrated_ahr$AHR_H_true, 3), "\n")
Achieved AHR(H): 2 
cat("AHR(H^c):", round(dgm_calibrated_ahr$AHR_Hc_true, 3), "\n")
AHR(H^c): 0.585 
cat("Overall AHR:", round(dgm_calibrated_ahr$AHR, 3), "\n")
Overall AHR: 0.683 

3.3.3 Compare Cox HR vs AHR Calibration

# Compare the two calibration approaches
cat("Comparison of calibration methods:\n")
Comparison of calibration methods:
cat(sprintf("%-20s %-12s %-12s\n", "Metric", "Cox-calib", "AHR-calib"))
Metric               Cox-calib    AHR-calib   
cat(sprintf("%-20s %-12.4f %-12.4f\n", "k_inter", k_inter_cox, k_inter_ahr))
k_inter              1.4947       1.3016      
cat(sprintf("%-20s %-12.4f %-12.4f\n", "HR(H)", 
            dgm_calibrated_cox$hr_H_true, dgm_calibrated_ahr$hr_H_true))
HR(H)                2.0000       1.7233      
cat(sprintf("%-20s %-12.4f %-12.4f\n", "AHR(H)", 
            dgm_calibrated_cox$AHR_H_true, dgm_calibrated_ahr$AHR_H_true))
AHR(H)               2.4001       2.0000      

3.4 Validating k_inter Effect on Heterogeneity

Use validate_k_inter_effect() to verify the interaction parameter properly modulates treatment effect heterogeneity:

# Test k_inter effect on HR heterogeneity
# k_inter = 0 should give ratio ≈ 1 (no heterogeneity)
validation_results <- validate_k_inter_effect(
  k_inter_values = c(-2, -1, 0, 1, 2, 3),
  verbose = TRUE
)
Testing k_inter effect on HR heterogeneity...

k_inter  HR(H)    HR(Hc)   AHR(H)   AHR(Hc)  Ratio(Cox) Ratio(AHR)
---------------------------------------------------------------------- 
-2.0     0.1336   0.6612   0.0884   0.5848   0.2021     0.1512    
-1.0     0.3033   0.6612   0.2274   0.5848   0.4587     0.3888    
0.0      0.6552   0.6612   0.5848   0.5848   0.9909     1.0000    
1.0      1.3873   0.6612   1.5041   0.5848   2.0982     2.5721    
2.0      2.9651   0.6612   3.8687   0.5848   4.4846     6.6157    
3.0      6.6375   0.6612   9.9507   0.5848   10.0387    17.0162   

PASS: k_inter = 0 gives Cox ratio ~= 1 (no heterogeneity)
PASS: k_inter = 0 gives AHR ratio ~= 1 (no heterogeneity)

AHR vs Cox HR alignment:
  k_inter = -2.0: HR(H) vs AHR(H) diff = 0.0452
  k_inter = -1.0: HR(H) vs AHR(H) diff = 0.0759
  k_inter = 0.0: HR(H) vs AHR(H) diff = 0.0704
  k_inter = 1.0: HR(H) vs AHR(H) diff = 0.1168
  k_inter = 2.0: HR(H) vs AHR(H) diff = 0.9036
  k_inter = 3.0: HR(H) vs AHR(H) diff = 3.3132

3.5 Null Hypothesis (Uniform Treatment Effect)

For Type I error evaluation, create a DGM with uniform treatment effect:

# Create null DGM (no treatment effect heterogeneity)
dgm_null <- create_gbsg_dgm(
  model = "null",
  k_treat = 1.0,
  verbose = TRUE
)

cat("\nNull hypothesis HRs:\n")

Null hypothesis HRs:
cat("Overall HR:", round(dgm_null$hr_causal, 3), "\n")
Overall HR: 0.722 
cat("HR(H^c):", round(dgm_null$hr_Hc_true, 3), "\n")
HR(H^c): 0.722 
cat("AHR(H^c):", round(dgm_null$AHR_Hc_true, 3), "\n")
AHR(H^c): 0.654 
cat("AHR:", round(dgm_null$AHR, 3), "\n")
AHR: 0.654 

4 Simulating Trial Data

4.1 Single Trial Simulation

Use simulate_from_gbsg_dgm() to generate a single simulated trial:

# Use the Cox-calibrated DGM for simulations
dgm_calibrated <- dgm_calibrated_cox

# Simulate a single trial
sim_data <- simulate_from_gbsg_dgm(
  dgm = dgm_calibrated,
  n = 700,
  rand_ratio = 1,        # 1:1 randomization
  sim_id = 1,
  max_follow = 84,       # 84 months administrative censoring
  muC_adj = log(1.5)     # Censoring adjustment
)

# Examine the data
cat("Simulated trial:\n")
Simulated trial:
cat("  N =", nrow(sim_data), "\n")
  N = 700 
cat("  Events =", sum(sim_data$event.sim), 
    "(", round(100 * mean(sim_data$event.sim), 1), "%)\n")
  Events = 376 ( 53.7 %)
cat("  Harm subgroup size =", sum(sim_data$flag.harm),
    "(", round(100 * mean(sim_data$flag.harm), 1), "%)\n")
  Harm subgroup size = 86 ( 12.3 %)
# Quick survival analysis
fit_itt <- coxph(Surv(y.sim, event.sim) ~ treat, data = sim_data)
cat("  Estimated ITT HR =", round(exp(coef(fit_itt)), 3), "\n")
  Estimated ITT HR = 0.741 

4.1.1 Examining Individual-Level Effects in Simulated Data

# The simulated data now includes loghr_po
if ("loghr_po" %in% names(sim_data)) {
  cat("\nIndividual treatment effects in simulated trial:\n")
  
  # Compute AHR in simulated data by subgroup
  ahr_H_sim <- exp(mean(sim_data$loghr_po[sim_data$flag.harm == 1]))
  ahr_Hc_sim <- exp(mean(sim_data$loghr_po[sim_data$flag.harm == 0]))
  ahr_overall_sim <- exp(mean(sim_data$loghr_po))
  
  cat("  AHR(H) in sim:", round(ahr_H_sim, 3), "\n")
  cat("  AHR(Hc) in sim:", round(ahr_Hc_sim, 3), "\n")
  cat("  AHR(overall) in sim:", round(ahr_overall_sim, 3), "\n")
} else {
  cat("\nNote: loghr_po not available in simulated data\n")
}

Individual treatment effects in simulated trial:
  AHR(H) in sim: 2.4 
  AHR(Hc) in sim: 0.585 
  AHR(overall) in sim: 0.696 

4.2 Examining Covariate Structure

dfcount <- df_counting(
  df = sim_data,
  by.risk = 6,
  tte.name = "y.sim", 
  event.name = "event.sim", 
  treat.name = "treat"
)
plot_weighted_km(dfcount, conf.int = TRUE, show.logrank = TRUE, 
                 ymax = 1.05, xmed.fraction = 0.775, ymed.offset = 0.125)

create_summary_table(
  data = sim_data, 
  treat_var = "treat", 
  table_title = "Characteristics by Treatment Arm",
  vars_continuous = c("z1", "z2", "size", "z3", "z4", "z5"),
  vars_categorical = c("flag.harm", "grade3"),
  font_size = 12
)
Characteristics by Treatment Arm
Characteristic Control (n=350) Treatment (n=350) P-value1 SMD2
z1 Mean (SD) 0.3 (0.4) 0.2 (0.4) 0.729 0.03
z2 Mean (SD) 2.4 (1.1) 2.5 (1.1) 0.157 0.11
size Mean (SD) 29.2 (14.3) 30.1 (17.0) 0.461 0.06
z3 Mean (SD) 0.4 (0.5) 0.4 (0.5) 0.249 0.09
z4 Mean (SD) 2.6 (1.1) 2.5 (1.1) 0.428 0.06
z5 Mean (SD) 2.4 (1.1) 2.4 (1.1) 0.563 0.04
flag.harm 0.908 0.00
0 306 (87.4%) 308 (88.0%)
1 44 (12.6%) 42 (12.0%)
grade3 0.530 0.02
0 273 (78.0%) 265 (75.7%)
1 77 (22.0%) 85 (24.3%)
1 P-values: t-test for continuous, chi-square/Fisher's exact for categorical/binary variables
2 SMD = Standardized mean difference (Cohen's d for continuous, Cramer's V for categorical)

5 Running Simulation Studies

5.1 Setting Up Parallel Processing

For efficient simulation studies, use parallel processing:

# Configure parallel backend
n_workers <- min(parallel::detectCores() - 1, 120)

plan(multisession, workers = n_workers)
registerDoFuture()

cat("Using", n_workers, "parallel workers\n")
Using 13 parallel workers

5.2 Define Simulation Parameters

# Simulation settings
sim_config_alt <- list(
  n_sims = 30,          # Number of simulations (use 500-1000 for final)
  n_sample = 700,         # Sample size per trial
  max_follow = 84,        # Maximum follow-up (months)
  seed_base = 8316951,
  muC_adj = log(1.5)
)

sim_config_null <- list(
  n_sims = 30,          # More simulations for Type I error estimation
  n_sample = 700,         # Sample size per trial
  max_follow = 84,        # Maximum follow-up (months)
  seed_base = 8316951,
  muC_adj = log(1.5)
)

# ForestSearch parameters (now includes use_twostage)
fs_params <- list(
  outcome.name = "y.sim",
  event.name = "event.sim",
  treat.name = "treat",
  id.name = "id",
  use_lasso = TRUE,
  use_grf = TRUE,
  hr.threshold = 1.25,
  hr.consistency = 1.0,
  pconsistency.threshold = 0.90,
  fs.splits = 400,
  n.min = 60,
  d0.min = 12,
  d1.min = 12,
  maxk = 2,
  by.risk = 12,
  vi.grf.min = -0.2,
  # NEW: Two-stage algorithm option
  use_twostage = TRUE,      # Set TRUE for faster exploratory analysis
  twostage_args = list()     # Optional tuning parameters
)


# Confounders for analysis
confounders_base <- c("z1", "z2", "z3", "z4", "z5", "size", "grade3")

5.2.1 Two-Stage Algorithm Option

The use_twostage parameter enables a faster two-stage search algorithm:

# Fast configuration with two-stage algorithm
fs_params_fast <- modifyList(fs_params, list(
  use_twostage = TRUE,
  twostage_args = list(
    n.splits.screen = 30,    # Stage 1 screening splits
    batch.size = 20,         # Stage 2 batch size
    conf.level = 0.95        # Early stopping confidence
  )
))

cat("Standard search: use_twostage =", fs_params$use_twostage, "\n")
Standard search: use_twostage = TRUE 
cat("Fast search: use_twostage =", fs_params_fast$use_twostage, "\n")
Fast search: use_twostage = TRUE 

5.3 Running Alternative Hypothesis Simulations

cat("Running", sim_config_alt$n_sims, "simulations under H1...\n")
Running 30 simulations under H1...
start_time <- Sys.time()

results_alt <- foreach(
  sim = 1:sim_config_alt$n_sims,
  .combine = rbind,
  .errorhandling = "remove",
  .options.future = list(
    packages = c("forestsearch", "survival", "data.table"),
    seed = TRUE
  )
) %dofuture% {
  run_simulation_analysis(
    sim_id = sim,
    dgm = dgm_calibrated,
    n_sample = sim_config_alt$n_sample,
    max_follow = sim_config_alt$max_follow,
    muC_adj = sim_config_alt$muC_adj,
    confounders_base = confounders_base,
    cox_formula_adj = survival::Surv(y.sim, event.sim) ~ treat + z1 + z2 + z3,
    n_add_noise = 0L,
    run_fs = TRUE,
    run_fs_grf = FALSE,
    run_grf = TRUE,
    fs_params = fs_params,
    verbose = TRUE,
    debug = FALSE,
    verbose_n = 3  # Only print first 3 simulations
  )
}

=== Two-Stage Consistency Evaluation Enabled ===
Stage 1 screening splits: 30 
Maximum total splits: 400 
Batch size: 20 
================================================

GRF stage for cut selection with dmin, tau = 12 0.6 
tau, maxdepth = 48.53742 2 
   leaf.node control.mean control.size control.se depth
1          2        -5.33       520.00       1.11     1
2          3         1.02       180.00       2.32     1
11         4        -6.71       426.00       1.21     2
21         5         3.44       142.00       2.62     2
4          7        -7.11        78.00       3.18     2
GRF subgroup NOT found
GRF cuts identified: 0 
# of continuous/categorical characteristics 4 3 
Continuous characteristics: z2 z4 z5 size 
Categorical characteristics: z1 z3 grade3 
## Prior to lasso: z2 z4 z5 size 
#### Lasso selection results 
7 x 1 sparse Matrix of class "dgCMatrix"
                s0
z1      0.28638272
z2      .         
z3      0.04573174
z4     -0.34834939
z5      0.45235142
size    .         
grade3  0.01493646
Cox-LASSO selected: z1 z3 z4 z5 grade3 
Cox-LASSO not selected: z2 size 
### End Lasso selection 
## After lasso: z4 z5 
Default cuts included from Lasso: z4 <= mean(z4) z4 <= median(z4) z4 <= qlow(z4) z4 <= qhigh(z4) z5 <= mean(z5) z5 <= median(z5) z5 <= qlow(z5) z5 <= qhigh(z5) 
Categorical after Lasso: z1 z3 grade3 
Factors per GRF:  
Initial GRF cuts included  
Factors included per GRF (not in lasso)  

===== CONSOLIDATED CUT EVALUATION (IMPROVED) =====
Evaluating 10 cut expressions once and caching...
Cut evaluation summary:
  Total cuts:  10 
  Valid cuts:  10 
  Errors:  0 
✓ All 10 factors validated as 0/1
===== END CONSOLIDATED CUT EVALUATION =====

# of candidate subgroup factors= 10 
 [1] "z4 <= 2.5" "z4 <= 3"   "z4 <= 2"   "z5 <= 2.4" "z5 <= 2"   "z5 <= 1"  
 [7] "z5 <= 3"   "z1"        "z3"        "grade3"   
Number of possible configurations (<= maxk): maxk = 2 , # combinations = 210 
Events criteria: control >= 12 , treatment >= 12 
Sample size criteria: n >= 60 
Subgroup search completed in 0.02 minutes

--- Filtering Summary ---
  Combinations evaluated: 210 
  Passed variance check: 189 
  Passed prevalence (>= 0.025 ): 189 
  Passed redundancy check: 178 
  Passed event counts (d0>= 12 , d1>= 12 ): 160 
  Passed sample size (n>= 60 ): 158 
  Cox model fit successfully: 158 
  Passed HR threshold (>= 1.25 ): 3 
-------------------------

Found 3 subgroup candidate(s)
# of candidate subgroups (meeting all criteria) = 3 
Random seed set to: 8316951 
Two-stage parameters:
  n.splits.screen: 30 
  screen.threshold: 0.763 
  batch.size: 20 
  conf.level: 0.95 
Removed 1 near-duplicate subgroups
# of unique initial candidates: 2 
# Restricting to top stop_Kgroups = 10 
# of candidates to evaluate: 2 
# Early stop threshold: 0.95 
Parallel config: workers = 6 , batch_size = 1 
Batch 1 / 2 : candidates 1 - 1 
Batch 2 / 2 : candidates 2 - 2 
Evaluated 2 of 2 candidates (complete) 
1 subgroups passed consistency threshold
SG focus = hr 
Seconds and minutes forestsearch overall = 10.688 0.1781 
Consistency algorithm used: twostage 
tau, maxdepth = 48.53742 2 
   leaf.node control.mean control.size control.se depth
1          2        -5.33       520.00       1.11     1
2          3         1.02       180.00       2.32     1
11         4        -6.71       426.00       1.21     2
21         5         3.44       142.00       2.62     2
4          7        -7.11        78.00       3.18     2
GRF subgroup NOT found

=== Two-Stage Consistency Evaluation Enabled ===
Stage 1 screening splits: 30 
Maximum total splits: 400 
Batch size: 20 
================================================

GRF stage for cut selection with dmin, tau = 12 0.6 
tau, maxdepth = 49.03326 2 
  leaf.node control.mean control.size control.se depth
1         2        -5.56       534.00       1.18     1
2         3         2.39       166.00       2.32     1
3         4        -5.10       262.00       1.63     2
4         5        10.27        96.00       2.89     2
5         6        -7.05       336.00       1.52     2

Selected subgroup:
  leaf.node control.mean control.size control.se depth
4         5        10.27        96.00       2.89     2

GRF subgroup found
Terminating node at max.diff (sg.harm.id):
[1] "z1 <= 0"

All splits (from all trees):
[1] "z1 <= 0"    "z2 <= 2"    "size <= 58"
GRF cuts identified: 3 
  Cuts: z1 <= 0, z2 <= 2, size <= 58 
  Selected tree depth: 2 
# of continuous/categorical characteristics 4 3 
Continuous characteristics: z2 z4 z5 size 
Categorical characteristics: z1 z3 grade3 
## Prior to lasso: z2 z4 z5 size 
#### Lasso selection results 
7 x 1 sparse Matrix of class "dgCMatrix"
               s0
z1      .        
z2      .        
z3      .        
z4     -0.2924157
z5      0.4106361
size    .        
grade3  .        
Cox-LASSO selected: z4 z5 
Cox-LASSO not selected: z1 z2 z3 size grade3 
### End Lasso selection 
## After lasso: z4 z5 
Default cuts included from Lasso: z4 <= mean(z4) z4 <= median(z4) z4 <= qlow(z4) z4 <= qhigh(z4) z5 <= mean(z5) z5 <= median(z5) z5 <= qlow(z5) z5 <= qhigh(z5) 
Categorical after Lasso: 
Factors per GRF: z1 <= 0 z2 <= 2 size <= 58 
Initial GRF cuts included z1 <= 0 z2 <= 2 size <= 58 
Factors included per GRF (not in lasso) z1 <= 0 z2 <= 2 size <= 58 

===== CONSOLIDATED CUT EVALUATION (IMPROVED) =====
Evaluating 10 cut expressions once and caching...
Cut evaluation summary:
  Total cuts:  10 
  Valid cuts:  10 
  Errors:  0 
✓ All 10 factors validated as 0/1
===== END CONSOLIDATED CUT EVALUATION =====

# of candidate subgroup factors= 10 
 [1] "z2 <= 2"    "size <= 58" "z4 <= 2.5"  "z4 <= 3"    "z4 <= 2"   
 [6] "z5 <= 2.5"  "z5 <= 2"    "z5 <= 1.8"  "z5 <= 3"    "z1 <= 0"   
Number of possible configurations (<= maxk): maxk = 2 , # combinations = 210 
Events criteria: control >= 12 , treatment >= 12 
Sample size criteria: n >= 60 
Subgroup search completed in 0.01 minutes

--- Filtering Summary ---
  Combinations evaluated: 210 
  Passed variance check: 189 
  Passed prevalence (>= 0.025 ): 189 
  Passed redundancy check: 174 
  Passed event counts (d0>= 12 , d1>= 12 ): 143 
  Passed sample size (n>= 60 ): 142 
  Cox model fit successfully: 142 
  Passed HR threshold (>= 1.25 ): 9 
-------------------------

Found 9 subgroup candidate(s)
# of candidate subgroups (meeting all criteria) = 9 
Random seed set to: 8316951 
Two-stage parameters:
  n.splits.screen: 30 
  screen.threshold: 0.763 
  batch.size: 20 
  conf.level: 0.95 
Removed 3 near-duplicate subgroups
# of unique initial candidates: 6 
# Restricting to top stop_Kgroups = 10 
# of candidates to evaluate: 6 
# Early stop threshold: 0.95 
Parallel config: workers = 6 , batch_size = 1 
Batch 1 / 6 : candidates 1 - 1 

==================================================
EARLY STOP TRIGGERED (batch 1 )
  Candidate: 1 of 6 
  Pcons: 1 >= 0.95 
==================================================

Evaluated 1 of 6 candidates (early stop) 
1 subgroups passed consistency threshold
SG focus = hr 
Seconds and minutes forestsearch overall = 3.134 0.0522 
Consistency algorithm used: twostage 
tau, maxdepth = 49.03326 2 
  leaf.node control.mean control.size control.se depth
1         2        -5.56       534.00       1.18     1
2         3         2.39       166.00       2.32     1
3         4        -5.10       262.00       1.63     2
4         5        10.27        96.00       2.89     2
5         6        -7.05       336.00       1.52     2

Selected subgroup:
  leaf.node control.mean control.size control.se depth
4         5        10.27        96.00       2.89     2

GRF subgroup found
Terminating node at max.diff (sg.harm.id):
[1] "z1 <= 0"

All splits (from all trees):
[1] "z1 <= 0"    "z2 <= 2"    "size <= 58"

=== Two-Stage Consistency Evaluation Enabled ===
Stage 1 screening splits: 30 
Maximum total splits: 400 
Batch size: 20 
================================================

GRF stage for cut selection with dmin, tau = 12 0.6 
tau, maxdepth = 49.57713 2 
   leaf.node control.mean control.size control.se depth
1          2        -6.22       494.00       1.21     1
2          3         5.44       206.00       2.11     1
21         5        -6.09       123.00       2.07     2
3          6        -6.92       385.00       1.46     2
4          7         7.19       143.00       2.55     2

Selected subgroup:
  leaf.node control.mean control.size control.se depth
4         7         7.19       143.00       2.55     2

GRF subgroup found
Terminating node at max.diff (sg.harm.id):
[1] "z1 <= 0"

All splits (from all trees):
[1] "z1 <= 0"    "z5 <= 1"    "size <= 18"
GRF cuts identified: 3 
  Cuts: z1 <= 0, z5 <= 1, size <= 18 
  Selected tree depth: 2 
# of continuous/categorical characteristics 4 3 
Continuous characteristics: z2 z4 z5 size 
Categorical characteristics: z1 z3 grade3 
## Prior to lasso: z2 z4 z5 size 
#### Lasso selection results 
7 x 1 sparse Matrix of class "dgCMatrix"
                s0
z1      .         
z2      .         
z3      0.07273592
z4     -0.40237672
z5      0.35008925
size    .         
grade3  .         
Cox-LASSO selected: z3 z4 z5 
Cox-LASSO not selected: z1 z2 size grade3 
### End Lasso selection 
## After lasso: z4 z5 
Default cuts included from Lasso: z4 <= mean(z4) z4 <= median(z4) z4 <= qlow(z4) z4 <= qhigh(z4) z5 <= mean(z5) z5 <= median(z5) z5 <= qlow(z5) z5 <= qhigh(z5) 
Categorical after Lasso: z3 
Factors per GRF: z1 <= 0 z5 <= 1 size <= 18 
Initial GRF cuts included z1 <= 0 z5 <= 1 size <= 18 
Factors included per GRF (not in lasso) z1 <= 0 z5 <= 1 size <= 18 

===== CONSOLIDATED CUT EVALUATION (IMPROVED) =====
Evaluating 11 cut expressions once and caching...
Cut evaluation summary:
  Total cuts:  11 
  Valid cuts:  10 
  Errors:  0 
Dropping variables (cut only has 1 level): z4 <= 4 
Total cuts after dropping invalid: 10
✓ All 10 factors validated as 0/1
===== END CONSOLIDATED CUT EVALUATION =====

# of candidate subgroup factors= 10 
 [1] "z5 <= 1"    "size <= 18" "z4 <= 2.5"  "z4 <= 2"    "z4 <= 1"   
 [6] "z5 <= 2.5"  "z5 <= 2"    "z5 <= 3"    "z3"         "z1 <= 0"   
Number of possible configurations (<= maxk): maxk = 2 , # combinations = 210 
Events criteria: control >= 12 , treatment >= 12 
Sample size criteria: n >= 60 
Subgroup search completed in 0.01 minutes

--- Filtering Summary ---
  Combinations evaluated: 210 
  Passed variance check: 189 
  Passed prevalence (>= 0.025 ): 189 
  Passed redundancy check: 178 
  Passed event counts (d0>= 12 , d1>= 12 ): 162 
  Passed sample size (n>= 60 ): 154 
  Cox model fit successfully: 154 
  Passed HR threshold (>= 1.25 ): 14 
-------------------------

Found 14 subgroup candidate(s)
# of candidate subgroups (meeting all criteria) = 14 
Random seed set to: 8316951 
Two-stage parameters:
  n.splits.screen: 30 
  screen.threshold: 0.763 
  batch.size: 20 
  conf.level: 0.95 
Removed 4 near-duplicate subgroups
# of unique initial candidates: 10 
# Restricting to top stop_Kgroups = 10 
# of candidates to evaluate: 10 
# Early stop threshold: 0.95 
Parallel config: workers = 6 , batch_size = 1 
Batch 1 / 10 : candidates 1 - 1 

==================================================
EARLY STOP TRIGGERED (batch 1 )
  Candidate: 1 of 10 
  Pcons: 1 >= 0.95 
==================================================

Evaluated 1 of 10 candidates (early stop) 
1 subgroups passed consistency threshold
SG focus = hr 
Seconds and minutes forestsearch overall = 1.93 0.0322 
Consistency algorithm used: twostage 
tau, maxdepth = 49.57713 2 
   leaf.node control.mean control.size control.se depth
1          2        -6.22       494.00       1.21     1
2          3         5.44       206.00       2.11     1
21         5        -6.09       123.00       2.07     2
3          6        -6.92       385.00       1.46     2
4          7         7.19       143.00       2.55     2

Selected subgroup:
  leaf.node control.mean control.size control.se depth
4         7         7.19       143.00       2.55     2

GRF subgroup found
Terminating node at max.diff (sg.harm.id):
[1] "z1 <= 0"

All splits (from all trees):
[1] "z1 <= 0"    "z5 <= 1"    "size <= 18"
runtime_alt <- difftime(Sys.time(), start_time, units = "mins")
cat("Completed in", round(runtime_alt, 1), "minutes\n")
Completed in 0.3 minutes
cat("Results:", nrow(results_alt), "rows\n")
Results: 60 rows

5.4 Running Null Hypothesis Simulations

cat("Running", sim_config_null$n_sims, "simulations under H0...\n")
Running 30 simulations under H0...
start_time <- Sys.time()

results_null <- foreach(
  sim = 1:sim_config_null$n_sims,
  .combine = rbind,
  .errorhandling = "remove",
  .options.future = list(
    packages = c("forestsearch", "survival", "data.table"),
    seed = TRUE
  )
) %dofuture% {
  
  run_simulation_analysis(
    sim_id = sim,
    dgm = dgm_null,
    n_sample = sim_config_null$n_sample,
    max_follow = sim_config_null$max_follow,
    muC_adj = sim_config_null$muC_adj,
    confounders_base = confounders_base,
    cox_formula_adj = survival::Surv(y.sim, event.sim) ~ treat + z1 + z2 + z3,
    n_add_noise = 0L,
    run_fs = TRUE,
    run_fs_grf = FALSE,
    run_grf = TRUE,
    fs_params = fs_params,
    verbose = TRUE,
    verbose_n = 3  # Only print first 3 simulations

  )
}

=== Two-Stage Consistency Evaluation Enabled ===
Stage 1 screening splits: 30 
Maximum total splits: 400 
Batch size: 20 
================================================

GRF stage for cut selection with dmin, tau = 12 0.6 
tau, maxdepth = 47.91247 2 
  leaf.node control.mean control.size control.se depth
2         3        -4.78       695.00       1.00     1
1         4        -5.09       568.00       1.10     2
GRF subgroup NOT found
GRF cuts identified: 0 
# of continuous/categorical characteristics 4 3 
Continuous characteristics: z2 z4 z5 size 
Categorical characteristics: z1 z3 grade3 
## Prior to lasso: z2 z4 z5 size 
#### Lasso selection results 
7 x 1 sparse Matrix of class "dgCMatrix"
                  s0
z1      0.0078483334
z2      0.0324623135
z3      .           
z4     -0.3338944143
z5      0.4663620842
size    0.0002302206
grade3  .           
Cox-LASSO selected: z1 z2 z4 z5 size 
Cox-LASSO not selected: z3 grade3 
### End Lasso selection 
## After lasso: z2 z4 z5 size 
Default cuts included from Lasso: z2 <= mean(z2) z2 <= median(z2) z2 <= qlow(z2) z2 <= qhigh(z2) z4 <= mean(z4) z4 <= median(z4) z4 <= qlow(z4) z4 <= qhigh(z4) z5 <= mean(z5) z5 <= median(z5) z5 <= qlow(z5) z5 <= qhigh(z5) size <= mean(size) size <= median(size) size <= qlow(size) size <= qhigh(size) 
Categorical after Lasso: z1 
Factors per GRF:  
Initial GRF cuts included  
Factors included per GRF (not in lasso)  

===== CONSOLIDATED CUT EVALUATION (IMPROVED) =====
Evaluating 15 cut expressions once and caching...
Cut evaluation summary:
  Total cuts:  15 
  Valid cuts:  14 
  Errors:  0 
Dropping variables (cut only has 1 level): z2 <= 4 
Total cuts after dropping invalid: 14
✓ All 14 factors validated as 0/1
===== END CONSOLIDATED CUT EVALUATION =====

# of candidate subgroup factors= 14 
 [1] "z2 <= 2.5"    "z2 <= 2"      "z4 <= 2.5"    "z4 <= 3"      "z4 <= 2"     
 [6] "z5 <= 2.4"    "z5 <= 2"      "z5 <= 1"      "z5 <= 3"      "size <= 29.1"
[11] "size <= 25"   "size <= 20"   "size <= 35"   "z1"          
Number of possible configurations (<= maxk): maxk = 2 , # combinations = 406 
Events criteria: control >= 12 , treatment >= 12 
Sample size criteria: n >= 60 
Subgroup search completed in 0.03 minutes

--- Filtering Summary ---
  Combinations evaluated: 406 
  Passed variance check: 373 
  Passed prevalence (>= 0.025 ): 373 
  Passed redundancy check: 354 
  Passed event counts (d0>= 12 , d1>= 12 ): 327 
  Passed sample size (n>= 60 ): 316 
  Cox model fit successfully: 316 
  Passed HR threshold (>= 1.25 ): 4 
-------------------------

Found 4 subgroup candidate(s)
# of candidate subgroups (meeting all criteria) = 4 
Random seed set to: 8316951 
Two-stage parameters:
  n.splits.screen: 30 
  screen.threshold: 0.763 
  batch.size: 20 
  conf.level: 0.95 
Removed 2 near-duplicate subgroups
# of unique initial candidates: 2 
# Restricting to top stop_Kgroups = 10 
# of candidates to evaluate: 2 
# Early stop threshold: 0.95 
Parallel config: workers = 6 , batch_size = 1 
Batch 1 / 2 : candidates 1 - 1 
Batch 2 / 2 : candidates 2 - 2 
Evaluated 2 of 2 candidates (complete) 
No subgroups found meeting consistency threshold
Seconds and minutes forestsearch overall = 9.227 0.1538 
Consistency algorithm used: twostage 
tau, maxdepth = 47.91247 2 
  leaf.node control.mean control.size control.se depth
2         3        -4.78       695.00       1.00     1
1         4        -5.09       568.00       1.10     2
GRF subgroup NOT found

=== Two-Stage Consistency Evaluation Enabled ===
Stage 1 screening splits: 30 
Maximum total splits: 400 
Batch size: 20 
================================================

GRF stage for cut selection with dmin, tau = 12 0.6 
tau, maxdepth = 49.50766 2 
  leaf.node control.mean control.size control.se depth
1         2        -4.62       689.00       1.07     1
2         5        -6.72       124.00       2.32     2
3         6        -5.20       513.00       1.27     2
GRF subgroup NOT found
GRF cuts identified: 0 
# of continuous/categorical characteristics 4 3 
Continuous characteristics: z2 z4 z5 size 
Categorical characteristics: z1 z3 grade3 
## Prior to lasso: z2 z4 z5 size 
#### Lasso selection results 
7 x 1 sparse Matrix of class "dgCMatrix"
               s0
z1      .        
z2      .        
z3     -0.1081342
z4     -0.2654241
z5      0.4210064
size    .        
grade3  .        
Cox-LASSO selected: z3 z4 z5 
Cox-LASSO not selected: z1 z2 size grade3 
### End Lasso selection 
## After lasso: z4 z5 
Default cuts included from Lasso: z4 <= mean(z4) z4 <= median(z4) z4 <= qlow(z4) z4 <= qhigh(z4) z5 <= mean(z5) z5 <= median(z5) z5 <= qlow(z5) z5 <= qhigh(z5) 
Categorical after Lasso: z3 
Factors per GRF:  
Initial GRF cuts included  
Factors included per GRF (not in lasso)  

===== CONSOLIDATED CUT EVALUATION (IMPROVED) =====
Evaluating 8 cut expressions once and caching...
Cut evaluation summary:
  Total cuts:  8 
  Valid cuts:  8 
  Errors:  0 
✓ All 8 factors validated as 0/1
===== END CONSOLIDATED CUT EVALUATION =====

# of candidate subgroup factors= 8 
[1] "z4 <= 2.5" "z4 <= 3"   "z4 <= 2"   "z5 <= 2.5" "z5 <= 2"   "z5 <= 1.8"
[7] "z5 <= 3"   "z3"       
Number of possible configurations (<= maxk): maxk = 2 , # combinations = 136 
Events criteria: control >= 12 , treatment >= 12 
Sample size criteria: n >= 60 
Subgroup search completed in 0 minutes

--- Filtering Summary ---
  Combinations evaluated: 136 
  Passed variance check: 117 
  Passed prevalence (>= 0.025 ): 117 
  Passed redundancy check: 106 
  Passed event counts (d0>= 12 , d1>= 12 ): 98 
  Passed sample size (n>= 60 ): 98 
  Cox model fit successfully: 98 
  Passed HR threshold (>= 1.25 ): 0 
-------------------------

NO subgroup candidates found
tau, maxdepth = 49.50766 2 
  leaf.node control.mean control.size control.se depth
1         2        -4.62       689.00       1.07     1
2         5        -6.72       124.00       2.32     2
3         6        -5.20       513.00       1.27     2
GRF subgroup NOT found

=== Two-Stage Consistency Evaluation Enabled ===
Stage 1 screening splits: 30 
Maximum total splits: 400 
Batch size: 20 
================================================

GRF stage for cut selection with dmin, tau = 12 0.6 
tau, maxdepth = 46.7094 2 
  leaf.node control.mean control.size control.se depth
1         2         2.55       101.00       2.51     1
2         3        -5.13       599.00       1.05     1
3         4         3.54        91.00       2.44     2
4         5        -6.03       449.00       1.13     2
5         6        -7.26       105.00       2.90     2
GRF subgroup NOT found
GRF cuts identified: 0 
# of continuous/categorical characteristics 4 3 
Continuous characteristics: z2 z4 z5 size 
Categorical characteristics: z1 z3 grade3 
## Prior to lasso: z2 z4 z5 size 
#### Lasso selection results 
7 x 1 sparse Matrix of class "dgCMatrix"
                  s0
z1     -0.2046970204
z2      .           
z3     -0.0162003130
z4     -0.3904744742
z5      0.3456587351
size    .           
grade3  0.0002404273
Cox-LASSO selected: z1 z3 z4 z5 grade3 
Cox-LASSO not selected: z2 size 
### End Lasso selection 
## After lasso: z4 z5 
Default cuts included from Lasso: z4 <= mean(z4) z4 <= median(z4) z4 <= qlow(z4) z4 <= qhigh(z4) z5 <= mean(z5) z5 <= median(z5) z5 <= qlow(z5) z5 <= qhigh(z5) 
Categorical after Lasso: z1 z3 grade3 
Factors per GRF:  
Initial GRF cuts included  
Factors included per GRF (not in lasso)  

===== CONSOLIDATED CUT EVALUATION (IMPROVED) =====
Evaluating 10 cut expressions once and caching...
Cut evaluation summary:
  Total cuts:  10 
  Valid cuts:  9 
  Errors:  0 
Dropping variables (cut only has 1 level): z4 <= 4 
Total cuts after dropping invalid: 9
✓ All 9 factors validated as 0/1
===== END CONSOLIDATED CUT EVALUATION =====

# of candidate subgroup factors= 9 
[1] "z4 <= 2.5" "z4 <= 2"   "z4 <= 1"   "z5 <= 2.5" "z5 <= 2"   "z5 <= 3"  
[7] "z1"        "z3"        "grade3"   
Number of possible configurations (<= maxk): maxk = 2 , # combinations = 171 
Events criteria: control >= 12 , treatment >= 12 
Sample size criteria: n >= 60 
Subgroup search completed in 0.01 minutes

--- Filtering Summary ---
  Combinations evaluated: 171 
  Passed variance check: 154 
  Passed prevalence (>= 0.025 ): 154 
  Passed redundancy check: 146 
  Passed event counts (d0>= 12 , d1>= 12 ): 141 
  Passed sample size (n>= 60 ): 135 
  Cox model fit successfully: 135 
  Passed HR threshold (>= 1.25 ): 0 
-------------------------

NO subgroup candidates found
tau, maxdepth = 46.7094 2 
  leaf.node control.mean control.size control.se depth
1         2         2.55       101.00       2.51     1
2         3        -5.13       599.00       1.05     1
3         4         3.54        91.00       2.44     2
4         5        -6.03       449.00       1.13     2
5         6        -7.26       105.00       2.90     2
GRF subgroup NOT found
runtime_null <- difftime(Sys.time(), start_time, units = "mins")
cat("Completed in", round(runtime_null, 1), "minutes\n")
Completed in 0.3 minutes

6 Summarizing Results

6.1 Operating Characteristics Summary

# Summarize alternative hypothesis results
summary_alt <- summarize_simulation_results(results_alt)
print(summary_alt)
                  FS     GRF
any.H          0.900   0.730
sens           0.870   0.960
spec           0.990   0.980
ppv            0.900   0.890
npv            0.980   0.990
Avg(#H)       84.000  95.000
minH          65.000  68.000
maxH         140.000 143.000
Avg(#Hc)     624.000 630.000
minHc        560.000 557.000
maxHc        700.000 700.000
hat(H*)        2.566     NaN
hat(hat[H])    2.568   2.444
hat(Hc*)       0.639     NaN
hat(hat[Hc])   0.647   0.638
hat(H*)all     2.566     NaN
hat(Hc*)all    0.639     NaN
hat(ITT)       0.756   0.756
hat(ITTadj)    0.772   0.772
# Summarize null hypothesis results
summary_null <- summarize_simulation_results(results_null)
print(summary_null)
                  FS     GRF
any.H          0.100   0.030
sens             NaN     NaN
spec           0.870   0.880
ppv            0.000   0.000
npv            1.000   1.000
Avg(#H)       94.000  81.000
minH          81.000  81.000
maxH         103.000  81.000
Avg(#Hc)     691.000 697.000
minHc        597.000 619.000
maxHc        700.000 700.000
hat(H*)          NaN     NaN
hat(hat[H])    1.703   1.688
hat(Hc*)       0.813     NaN
hat(hat[Hc])   0.727   0.731
hat(H*)all       NaN     NaN
hat(Hc*)all    0.813     NaN
hat(ITT)       0.710   0.710
hat(ITTadj)    0.703   0.703

7 Theoretical Subgroup Detection Rate Approximation

The function compute_detection_probability() provides an analytical approximation based on asymptotic normal theory:

#| label: theoretical-power
#| fig-width: 8
#| fig-height: 6

# =============================================================================
# Theoretical Detection Probability Analysis
# =============================================================================

# Calculate expected subgroup characteristics
n_sg_expected <- sim_config_alt$n_sample * mean(dgm_calibrated$df_super_rand$flag.harm)
prop_cens <- mean(results_alt$p.cens)  # Censoring proportion

cat("=== Subgroup Characteristics ===\n")
=== Subgroup Characteristics ===
cat("Expected subgroup size (n_sg):", round(n_sg_expected), "\n")
Expected subgroup size (n_sg): 89 
cat("Censoring proportion:", round(prop_cens, 3), "\n")
Censoring proportion: 0.457 
cat("True HR in H:", round(dgm_calibrated$hr_H_true, 3), "\n")
True HR in H: 2 
cat("HR threshold:", fs_params$hr.threshold, "\n")
HR threshold: 1.25 
# -----------------------------------------------------------------------------
# Single-Point Detection Probability
# -----------------------------------------------------------------------------

# True H is dgm_calibrated$hr_H_true
# However we want at plim of observed estimate
#plim_hr_hatH <- c(summary_alt[c("hat(hat[H])"),1])

dgm_calibrated$hr_H_true
   treat 
1.999999 
# Compute detection probability at the true HR
prob_detect <- compute_detection_probability(
 theta = dgm_calibrated$hr_H_true,
  n_sg = round(n_sg_expected),
  prop_cens = prop_cens,
  hr_threshold = fs_params$hr.threshold,
  hr_consistency = fs_params$hr.consistency,
  method = "cubature"
)

# Compare theoretical to empirical (alternative)
cat("\n=== Detection Probability Comparison ===\n")

=== Detection Probability Comparison ===
cat("Theoretical FS (asymptotic):", round(prob_detect, 3), "\n")
Theoretical FS (asymptotic): 0.898 
cat("Empirical FS:", round(mean(results_alt[analysis == "FS"]$any.H), 3), "\n")
Empirical FS: 0.9 
cat("Empirical FSlg:", round(mean(results_alt[analysis == "FSlg"]$any.H), 3), "\n")
Empirical FSlg: NaN 
if ("GRF" %in% results_alt$analysis) {
  cat("Empirical GRF:", round(mean(results_alt[analysis == "GRF"]$any.H), 3), "\n")
}
Empirical GRF: 0.733 
# Null 

#plim_hr_itt <- c(summary_alt[c("hat(ITT)all"),1])

# Calculate at min SG size
# Compute detection probability at the true HR
prob_detect_null <- compute_detection_probability(
 theta = dgm_null$hr_causal,
  n_sg = fs_params$n.min,
  prop_cens = prop_cens,
  hr_threshold = fs_params$hr.threshold,
  hr_consistency = fs_params$hr.consistency,
  method = "cubature"
)


# Compare theoretical to empirical (alternative)
cat("\n=== Detection Probability Comparison ===\n")

=== Detection Probability Comparison ===
cat("Under the null calculate at min SG size:", fs_params$n.min,"\n")
Under the null calculate at min SG size: 60 
cat("Theoretical FS at min(SG) (asymptotic):", round(prob_detect_null, 6), "\n")
Theoretical FS at min(SG) (asymptotic): 0.04005 
cat("Empirical FS:", round(mean(results_null[analysis == "FS"]$any.H), 6), "\n")
Empirical FS: 0.1 
cat("Empirical FSlg:", round(mean(results_null[analysis == "FSlg"]$any.H), 6), "\n")
Empirical FSlg: NaN 
if ("GRF" %in% results_null$analysis) {
  cat("Empirical GRF:", round(mean(results_null[analysis == "GRF"]$any.H), 6), "\n")
}
Empirical GRF: 0.033333 
prop_cens <- mean(results_null$p.cens)  # Censoring proportion
cat("Censoring proportion:", round(prop_cens, 3), "\n")
Censoring proportion: 0.47 
# -----------------------------------------------------------------------------
# Generate Full Detection Curve
# -----------------------------------------------------------------------------

# Generate detection probability curve across HR values
detection_curve <- generate_detection_curve(
  theta_range = c(0.5, 3.0),
  n_points = 50,
  n_sg = round(n_sg_expected),
  prop_cens = prop_cens,
  hr_threshold = fs_params$hr.threshold,
  hr_consistency = fs_params$hr.consistency,
  include_reference = TRUE,
  verbose = FALSE
)

# -----------------------------------------------------------------------------
# Visualization
# -----------------------------------------------------------------------------

# Plot detection curve with empirical overlay
plot_detection_curve(
  detection_curve,
  add_reference_lines = TRUE,
  add_threshold_line = TRUE,
  title = sprintf(
    "Detection Probability Curve (n=%d, cens=%.0f%%, threshold=%.2f)",
    round(n_sg_expected), 100 * prop_cens, fs_params$hr.threshold
  )
)

# Add empirical results as points
empirical_rates <- c(
  FS = mean(results_alt[analysis == "FS"]$any.H),
  FSlg = mean(results_alt[analysis == "FSlg"]$any.H)
)
if ("GRF" %in% results_alt$analysis) {
  empirical_rates["GRF"] <- mean(results_alt[analysis == "GRF"]$any.H)
}

# Mark the true HR and empirical detection rates
points(
  x = rep(dgm_calibrated$hr_H_true, length(empirical_rates)),
  y = empirical_rates,
  pch = c(16, 17, 18)[1:length(empirical_rates)],
  col = c("blue", "darkgreen", "purple")[1:length(empirical_rates)],
  cex = 1.5
)

# Add vertical line at true HR
abline(v = dgm_calibrated$hr_H_true, lty = 2, col = "blue", lwd = 1)

# Legend for empirical points
legend(
  "topleft",
  legend = c(
    sprintf("H true = %.2f", dgm_calibrated$hr_H_true),
    paste(names(empirical_rates), "=", round(empirical_rates, 3))
  ),
  pch = c(NA, 16, 17, 18)[1:(length(empirical_rates) + 1)],
  lty = c(2, rep(NA, length(empirical_rates))),
  col = c("blue", "blue", "darkgreen", "purple")[1:(length(empirical_rates) + 1)],
  cex = 0.8,
  bty = "n"
)

7.1 AHR Metrics in Results (New)

The aligned analysis functions now compute AHR estimates in addition to Cox-based HRs:

# Check for AHR columns in results
ahr_cols <- grep("ahr", names(results_alt), value = TRUE)
cat("AHR columns in results:", paste(ahr_cols, collapse = ", "), "\n\n")
AHR columns in results:  
if (length(ahr_cols) > 0) {
  # Summarize AHR estimates
  results_found <- results_alt[results_alt$any.H == 1, ]
  
  if (nrow(results_found) > 0 && "ahr.H.hat" %in% names(results_found)) {
    cat("AHR estimates (when subgroup found):\n")
    cat("  Mean AHR(H) estimated:", round(mean(results_found$ahr.H.hat, na.rm = TRUE), 3), "\n")
    cat("  Mean AHR(Hc) estimated:", round(mean(results_found$ahr.Hc.hat, na.rm = TRUE), 3), "\n")
    cat("  True AHR(H):", round(dgm_calibrated$AHR_H_true, 3), "\n")
    cat("  True AHR(Hc):", round(dgm_calibrated$AHR_Hc_true, 3), "\n")
  }
}

7.2 Formatted Tables

# Format operating characteristics for H1
format_oc_results(
  results = results_alt,
  title = "Operating Characteristics (Alternative Hypothesis)",
  subtitle = sprintf("n = %d, %d simulations, HR(H) = %.2f",
                     sim_config_alt$n_sample,
                     sim_config_alt$n_sims,
                     dgm_calibrated$hr_H_true),
  use_gt = TRUE
)
Operating Characteristics (Alternative Hypothesis)
n = 700, 30 simulations, HR(H) = 2.00
Method N Sims Detection
Classification
Hazard Ratios
Size_H_mean Size_H_min Size_H_max
Sen Spec PPV NPV HR_H_hat HR_Hc_hat HR_H_true HR_Hc_true HR_ITT
FS 30 0.900 0.872 0.985 0.897 0.982 2.568 0.647 2.566 0.639 0.756 84 65 140
GRF 30 0.733 0.960 0.977 0.889 0.994 2.444 0.638 - - 0.756 95 68 143
# Format operating characteristics for H0
format_oc_results(
  results = results_null,
  title = "Type I Error (Null Hypothesis)",
  subtitle = sprintf("n = %d, %d simulations, HR(overall) = %.2f",
                     sim_config_null$n_sample,
                     sim_config_null$n_sims,
                     dgm_null$hr_causal),
  use_gt = TRUE
)
Type I Error (Null Hypothesis)
n = 700, 30 simulations, HR(overall) = 0.72
Method N Sims Detection
Classification
Hazard Ratios
Size_H_mean Size_H_min Size_H_max
Sen Spec PPV NPV HR_H_hat HR_Hc_hat HR_H_true HR_Hc_true HR_ITT
FS 30 0.100 - 0.865 0.000 1.000 1.703 0.727 - 0.813 0.710 94 81 103
GRF 30 0.033 - 0.884 0.000 1.000 1.688 0.731 - - 0.710 81 81 81

7.3 Key Metrics

# Extract key metrics
cat("=== KEY OPERATING CHARACTERISTICS ===\n\n")
=== KEY OPERATING CHARACTERISTICS ===
cat("Alternative Hypothesis (H1):\n")
Alternative Hypothesis (H1):
for (analysis in unique(results_alt$analysis)) {
  res <- results_alt[results_alt$analysis == analysis, ]
  cat(sprintf("  %s: Power = %.3f, Sens = %.3f, Spec = %.3f, PPV = %.3f\n",
              analysis,
              mean(res$any.H),
              mean(res$sens, na.rm = TRUE),
              mean(res$spec, na.rm = TRUE),
              mean(res$ppv, na.rm = TRUE)))
}
  FS: Power = 0.817, Sens = 0.912, Spec = 0.982, PPV = 0.894
  GRF: Power = 0.817, Sens = 0.912, Spec = 0.982, PPV = 0.894
cat("\nNull Hypothesis (H0):\n")

Null Hypothesis (H0):
for (analysis in unique(results_null$analysis)) {
  res <- results_null[results_null$analysis == analysis, ]
  cat(sprintf("  %s: Type I Error = %.4f\n",
              analysis,
              mean(res$any.H)))
}
  FS: Type I Error = 0.0667
  GRF: Type I Error = 0.0667

8 Advanced Topics

8.1 Comparing Standard vs Two-Stage Algorithm

# Run simulations with two-stage algorithm for comparison
results_twostage <- foreach(
  sim = 1:100,
  .combine = rbind,
  .options.future = list(
    packages = c("forestsearch", "survival", "data.table"),
    seed = TRUE
  )
) %dofuture% {
  
  run_simulation_analysis(
    sim_id = sim,
    dgm = dgm_calibrated,
    n_sample = sim_config_alt$n_sample,
    confounders_base = confounders_base,
    run_fs = TRUE,
    run_fs_grf = FALSE,
    run_grf = FALSE,
    fs_params = fs_params_fast,  # use_twostage = TRUE
    verbose = FALSE
  )
}

# Compare detection rates
cat("Standard algorithm power:", round(mean(results_alt$any.H[results_alt$analysis == "FS"]), 3), "\n")
cat("Two-stage algorithm power:", round(mean(results_twostage$any.H), 3), "\n")

8.2 Adding Noise Variables

Test ForestSearch robustness by including irrelevant noise variables:

# Run simulations with noise variables
results_noise <- foreach(
  sim = 1:sim_config_alt$n_sims,
  .combine = rbind,
  .errorhandling = "remove",
  .options.future = list(
    packages = c("forestsearch", "survival", "data.table"),
    seed = TRUE
  )
) %dofuture% {
  
  run_simulation_analysis(
    sim_id = sim,
    dgm = dgm_calibrated,
    n_sample = sim_config_alt$n_sample,
    confounders_base = confounders_base,
    n_add_noise = 10,  # Add 10 noise variables
    run_fs = TRUE,
    fs_params = fs_params,
    verbose = FALSE
  )
}

# Compare detection rates
cat("Without noise:", round(mean(results_alt$any.H), 3), "\n")
cat("With 10 noise vars:", round(mean(results_noise$any.H), 3), "\n")

8.3 Sensitivity Analysis: Varying Parameters

# Test different HR thresholds
thresholds <- c(1.10, 1.25, 1.50)
results_by_thresh <- list()

for (thresh in thresholds) {
  results_by_thresh[[as.character(thresh)]] <- foreach(
    sim = 1:100,
    .combine = rbind,
    .options.future = list(
      packages = c("forestsearch", "survival", "data.table"),
      seed = TRUE
    )
  ) %dofuture% {
    
    run_simulation_analysis(
      sim_id = sim,
      dgm = dgm_calibrated,
      n_sample = sim_config_alt$n_sample,
      confounders_base = confounders_base,
      run_fs = TRUE,
      fs_params = modifyList(fs_params, list(hr.threshold = thresh)),
      verbose = FALSE
    )
  }
  results_by_thresh[[as.character(thresh)]]$threshold <- thresh
}

# Combine and summarize
combined <- rbindlist(results_by_thresh)
combined[, .(power = mean(any.H), ppv = mean(ppv, na.rm = TRUE)), 
         by = .(threshold, analysis)]

9 Saving Results

# Save simulation results for later use
save_simulation_results(
  results = results_alt,
  dgm = dgm_calibrated,
  summary_table = summary_alt,
  runtime_hours = as.numeric(runtime_alt) / 60,
  output_file = "forestsearch_simulation_alt.Rdata",
  # Include AHR metrics in saved output
  ahr_metrics = list(
    AHR_H_true = dgm_calibrated$AHR_H_true,
    AHR_Hc_true = dgm_calibrated$AHR_Hc_true,
    AHR = dgm_calibrated$AHR
  )
)

save_simulation_results(
  results = results_null,
  dgm = dgm_null,
  summary_table = summary_null,
  runtime_hours = as.numeric(runtime_null) / 60,
  output_file = "forestsearch_simulation_null.Rdata"
)

10 Complete Example Script

Here’s a minimal self-contained script for running a simulation study:

# ===========================================================================
# Complete ForestSearch Simulation Study - Minimal Example (Aligned)
# ===========================================================================

library(forestsearch)
library(data.table)
library(survival)
library(foreach)
library(doFuture)

# --- Configuration ---
N_SIMS <- 500
N_SAMPLE <- 500
TARGET_HR_HARM <- 1.5

# --- Setup parallel processing ---
plan(multisession, workers = 4)
registerDoFuture()

# --- Create DGM ---
# Option 1: Calibrate to Cox-based HR
k_inter <- calibrate_k_inter(target_hr_harm = TARGET_HR_HARM, 
                             use_ahr = FALSE, verbose = TRUE)

# Option 2: Calibrate to AHR instead
# k_inter <- calibrate_k_inter(target_hr_harm = TARGET_HR_HARM, 
#                              use_ahr = TRUE, verbose = TRUE)

dgm <- create_gbsg_dgm(model = "alt", k_inter = k_inter, verbose = TRUE)

# Verify hazard ratios (new aligned output)
cat("\nDGM Summary:\n")
cat("  Cox HR(H):", round(dgm$hr_H_true, 3), "\n")
cat("  AHR(H):", round(dgm$AHR_H_true, 3), "\n")
cat("  Cox HR(Hc):", round(dgm$hr_Hc_true, 3), "\n")
cat("  AHR(Hc):", round(dgm$AHR_Hc_true, 3), "\n")

# --- Run simulations ---
confounders <- c("v1", "v2", "v3", "v4", "v5", "v6", "v7")

results <- foreach(
  sim = 1:N_SIMS, 
  .combine = rbind,
  .options.future = list(
    packages = c("forestsearch", "survival", "data.table"),
    seed = TRUE
  )
) %dofuture% {
  run_simulation_analysis(
    sim_id = sim,
    dgm = dgm,
    n_sample = N_SAMPLE,
    max_follow = 60,
    confounders_base = confounders,
    run_fs = TRUE,
    run_fs_grf = TRUE,
    run_grf = FALSE,
    fs_params = list(
      hr.threshold = 1.25, 
      fs.splits = 300, 
      maxk = 2,
      use_twostage = FALSE  # Set TRUE for faster analysis
    )
  )
}

# --- Summarize ---
summary_table <- summarize_simulation_results(results)
print(summary_table)

# --- Display formatted table ---
format_oc_results(results = results, title = sprintf("Operating Characteristics (n=%d)", N_SAMPLE))

# --- Report AHR metrics (new) ---
results_found <- results[results$any.H == 1, ]
if (nrow(results_found) > 0 && "ahr.H.hat" %in% names(results_found)) {
  cat("\nAHR Estimates:\n")
  cat("  True AHR(H):", round(dgm$AHR_H_true, 3), "\n")
  cat("  Mean estimated AHR(H):", round(mean(results_found$ahr.H.hat, na.rm = TRUE), 3), "\n")
}

11 Summary

This vignette demonstrated the complete workflow for evaluating ForestSearch performance through simulation:

Step Function Purpose
1. Create DGM create_gbsg_dgm() Define data generating mechanism
2. Calibrate calibrate_k_inter() Achieve target subgroup HR (Cox or AHR)
3. Validate validate_k_inter_effect() Verify heterogeneity control
4. Simulate simulate_from_gbsg_dgm() Generate trial data
5. Analyze run_simulation_analysis() Run ForestSearch/GRF
6. Summarize summarize_simulation_results() Aggregate metrics
7. Display format_oc_results() Create gt tables

Key metrics to report:

  • Power (H1) / Type I Error (H0): Subgroup detection rate
  • Sensitivity: P(identified | true harm subgroup)
  • Specificity: P(not identified | true complement)
  • PPV: P(true harm | identified)
  • NPV: P(true complement | not identified)

New aligned features:

  • AHR metrics: Alternative to Cox-based HR (from loghr_po)
  • use_ahr calibration: Calibrate to AHR instead of Cox HR
  • use_twostage: Faster two-stage search algorithm option
  • Individual effects: Access theta_0, theta_1, loghr_po per subject

12 Session Info

sessionInfo()
R version 4.5.1 (2025-06-13)
Platform: aarch64-apple-darwin20
Running under: macOS Tahoe 26.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] doFuture_1.2.0          future_1.69.0           foreach_1.5.2          
[4] gt_1.3.0                ggplot2_4.0.1           survival_3.8-6         
[7] data.table_1.18.0       weightedsurv_0.1.0      forestsearch_0.0.0.9000

loaded via a namespace (and not attached):
 [1] sass_0.4.10          generics_0.1.4       xml2_1.4.0          
 [4] shape_1.4.6.1        stringi_1.8.7        lattice_0.22-7      
 [7] cubature_2.1.4-1     listenv_0.9.1        digest_0.6.37       
[10] magrittr_2.0.4       grf_2.5.0            evaluate_1.0.5      
[13] grid_4.5.1           RColorBrewer_1.1-3   iterators_1.0.14    
[16] policytree_1.2.3     fastmap_1.2.0        jsonlite_2.0.0      
[19] Matrix_1.7-3         glmnet_4.1-10        scales_1.4.0        
[22] codetools_0.2-20     cli_3.6.5            rlang_1.1.6         
[25] parallelly_1.46.1    future.apply_1.20.1  splines_4.5.1       
[28] withr_3.0.2          yaml_2.3.10          tools_4.5.1         
[31] parallel_4.5.1       dplyr_1.1.4          globals_0.18.0      
[34] vctrs_0.6.5          R6_2.6.1             lifecycle_1.0.4     
[37] stringr_1.6.0        randomForest_4.7-1.2 fs_1.6.6            
[40] htmlwidgets_1.6.4    pkgconfig_2.0.3      progressr_0.18.0    
[43] pillar_1.11.1        gtable_0.3.6         glue_1.8.0          
[46] Rcpp_1.1.0           xfun_0.53            tibble_3.3.0        
[49] tidyselect_1.2.1     rstudioapi_0.17.1    knitr_1.51          
[52] farver_2.1.2         htmltools_0.5.8.1    rmarkdown_2.30      
[55] compiler_4.5.1       S7_0.2.0            

13 References

  1. León LF, Marceau-West CT, He W, et al. (2024). “Identifying Patient Subgroups with Differential Treatment Effects: A Forest Search Approach.” Statistics in Medicine.

  2. Athey S, Imbens GW. (2016). “Recursive partitioning for heterogeneous causal effects.” PNAS, 113(27):7353-7360.

  3. Wager S, Athey S. (2018). “Estimation and inference of heterogeneous treatment effects using random forests.” JASA, 113(523):1228-1242.